Vietnamese Word Segmentation

نویسندگان

  • Dinh Dien
  • Kiem Hoang
  • Nguyen Van Toan
چکیده

Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages, including Chinese and Vietnamese, whitespaces are never used to determine the word boundaries, so one must resort to such higher levels of information as: information of morphology, syntax and even semantics and pragmatics. In this paper, we present a model combining WFST (Weighted Finite State Transducer) approach and Neural Network. This word segmentation system is applied to Text-to-speech of Vietnamese and POS-tagger of Vietnamese. We evaluate the performance by comparing its word segmentation results with the manually annotated corpus and its performance proves to be very good. Our algorithm achieves 97% of accuracy on a corpus of Vietnamese Electronic Textbooks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Segmentation of Vietnamese Texts: a Comparison of Approaches

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems,...

متن کامل

How does Dictionary Size Influence Performance of Vietnamese Word Segmentation?

Vietnamese word segmentation (VWS) is a challenging basic issue for natural language processing. This paper addresses the problem of how does dictionary size influence VWS performance, proposes two novel measures: square overlap ratio (SOR) and relaxed square overlap ratio (RSOR), and validates their effectiveness. The SOR measure is the product of dictionary overlap ratio and corpus overlap ra...

متن کامل

Comparing Different Criteria for Vietnamese Word Segmentation

Syntactically annotated corpora have become important resources for natural language processing due in part to the success of corpus-based methods. Since words are often considered as primitive units of language structures, the annotation of word segmentation forms the basis of these corpora. This is also an issue for the Vietnamese Treebank (VTB), which is the first and only publicly available...

متن کامل

A method for word segmentation in Vietnamese

Word segmentation is the very first step in natural language processing for languages such as Vietnamese. Given the fact that un-annotated corpora are the only widely available resources, we propose a method of word segmentation for Vietnamese, which only uses n-gram information. We calculate the probabilities of different combinations of n-grams in a chunk, and choose the one that produces max...

متن کامل

Using Search Engine to Construct a Scalable Corpus for Vietnamese Lexical Development for Word Segmentation

As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of V...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001